SQL PROJECT - SPATIAL DATA ANALYSIS WITH POSTGIS
1 Introduction
About this project:
As I am new to spatial data analysis, this project will only focus on myself following the instructions and exercises provided by the Introduction to PostGIS workshop. With that said, I will attempt to come up with my own questions and scenarios to apply the newly acquired knowledge. This is an ongoing project and will be updated regularly until the workshop is completed.
About the data:
The data for this workshop is four shapefiles for New York City, and one attribute table of sociodemographic variables.
- nyc_census_blocks
| Column | Description |
|---|---|
| blkid | A 15-digit code that uniquely identifies every census block. Eg: 360050001009000 |
| popn_total | Total number of people in the census block |
| popn_white | Number of people self-identifying as “White” in the block |
| popn_black | Number of people self-identifying as “Black” in the block |
| popn_nativ | Number of people self-identifying as “Native American” in the block |
| popn_asian | Number of people self-identifying as “Asian” in the block |
| popn_other | Number of people self-identifying with other categories in the block |
| boroname | Name of the New York borough. Manhattan, The Bronx, Brooklyn, Staten Island, Queens |
| geom | Polygon boundary of the block |
- nyc_neighborhoods
| Column | Description |
|---|---|
| name | Name of the neighborhood |
| boroname | Name of the New York borough. Manhattan, The Bronx, Brooklyn, Staten Island, Queens |
| geom | Polygon boundary of the neighborhood |
- nyc_streets
| Column | Description |
|---|---|
| name | Name of the street |
| oneway | Is the street one-way? “yes” = yes, “” = no |
| type | Road type (primary, secondary, residential, motorway) |
| geom | Linear centerline of the street |
- nyc_subway_stations
| Column | Description |
|---|---|
| name | Name of the station |
| borough | Name of the New York borough. Manhattan, The Bronx, Brooklyn, Staten Island, Queens |
| routes | Subway lines that run through this station |
| transfers | Lines you can transfer to via this station |
| express | Stations where express trains stop, “express” = yes, “” = no |
| geom | Point location of the station |
- nyc_census_sociodata
| Column | Description |
|---|---|
| tractid | An 11-digit code that uniquely identifies every census tract. (“36005000100”) |
| transit_total | Number of workers in the tract |
| transit_private | Number of workers in the tract who use private automobiles / motorcycles |
| transit_public | Number of workers in the tract who take public transit |
| transit_walk | Number of workers in the tract who walk |
| transit_other | Number of workers in the tract who use other forms like walking / biking |
| transit_none | Number of workers in the tract who work from home |
| transit_time_mins | Total number of minutes spent in transit by all workers in the tract (minutes) |
| family_count | Number of families in the tract |
| family_income_median | Median family income in the tract (dollars) |
| family_income_mean | Average family income in the tract (dollars) |
| family_income_aggregate | Total income of all families in the tract (dollars) |
| edu_total | Number of people with educational history |
| edu_no_highschool_dipl | Number of people with no high school diploma |
| edu_highschool_dipl | Number of people with high school diploma and no further education |
| edu_college_dipl | Number of people with college diploma and no further education |
| edu_graduate_dipl | Number of people with graduate school diploma |
2 Loading spatial data
2.1 Creating a spatial database in PostgreSQL
Select the database. Load PostGIS spatial extension to the database in PgAdmin using the following query.
CREATE EXTENSION postgis;Confim that PostGIS is installed
SELECT postgis_full_version();2.2 Importing spatial data with OGR2OGR
Since I already have QGIS installed, I will import spatial data through OGR2OGR.
In OSGeo4W Shell, navigate to the folder containing data to import and type:
ogr2ogr -nln nyc_census_blocks_2000 -nlt PROMOTE_TO_MULTI -lco GEOMETRY_NAME=geom -lco FID=gid -lco PRECISION=NO Pg:"dbname=nyc host=localhost user=postgres password=******** port=5432" nyc_census_blocks_2000.shpRemember to change the username and password.
Alternatively, data can be restored using .backup files in PgAdmin. To do so, right click on the database, then select Restore…
2.3 Trying to connect and view data with QGIS
Open QGIS, select Layer > Add Layer > Add PostGIS Layers
Click the New button, then fill out Name (nyc), Host (localhost), Database (nyc). Click OK to connect.
Once connected, select tables to add, then click Add.
Select layers to display from the Layers panel.
3 Running non-spatial queries
I will answer some questions about the datasets using SQL.
How many records are there in the nyc_census_block?
SELECT COUNT(*) FROM nyc_census_blocks;Output:
"count" 38794What is the total population of New York?
SELECT SUM(popn_total) AS total_population FROM nyc_census_blocks;Output:
"total_population" 8175032What is the population of each borough of New York?
SELECT boroname, SUM(popn_total) FROM nyc_census_blocks GROUP BY boroname;Output:
| "boroname" | "sum" | |-----------------|---------| | "Queens" | 2230621 | | "Brooklyn" | 2504700 | | "The Bronx" | 1385108 | | "Manhattan" | 1585873 | | "Staten Island" | 468730 |Summarize demographic data from nyc_census_block table.
SELECT boroname, SUM(popn_total) AS total_pop, ROUND(((SUM(popn_white) / SUM(popn_total)) * 100)::numeric, 2) AS white_pop_pct, ROUND(((SUM(popn_black) / SUM(popn_total)) * 100)::numeric, 2) AS black_pop_pct, ROUND(((SUM(popn_asian) / SUM(popn_total)) * 100)::numeric, 2) AS asian_pop_pct, ROUND(((SUM(popn_nativ) / SUM(popn_total)) * 100)::numeric, 2) AS native_pop_pct, ROUND(((SUM(popn_other) / SUM(popn_total)) * 100)::numeric, 2) AS others_pop_pct FROM nyc_census_blocks GROUP BY boroname ORDER BY 2 DESC;Output:
| "boroname" | "total_pop" | "white_pop_pct" | "black_pop_pct" | "asian_pop_pct" | "native_pop_pct" | "others_pop_pct" | |-----------------|-------------|-----------------|-----------------|-----------------|------------------|------------------| | "Brooklyn" | 2504700 | 42.80 | 34.34 | 10.47 | 0.54 | 11.85 | | "Queens" | 2230621 | 39.72 | 19.13 | 22.94 | 0.69 | 17.51 | | "Manhattan" | 1585873 | 57.45 | 15.56 | 11.32 | 0.55 | 15.13 | | "The Bronx" | 1385108 | 27.90 | 36.47 | 3.58 | 1.32 | 30.72 | | "Staten Island" | 468730 | 72.89 | 10.64 | 7.50 | 0.36 | 8.61 |How many subway stations are there in each borough in New York? Include a total number of stations to each row for comparison.
SELECT borough, (SELECT COUNT(*) FROM nyc_subway_stations) AS total_ny_stations, COUNT(name) AS stations_count FROM nyc_subway_stations GROUP BY borough;Output:
| "borough" | "total_ny_stations" | "stations_count" | |-----------------|---------------------|------------------| | "Queens" | 491 | 90 | | "Brooklyn" | 491 | 170 | | "Staten Island" | 491 | 22 | | "Manhattan" | 491 | 140 | | "Bronx" | 491 | 69 |Calculate the people per station metric for each borough.
-- Define a temporary table containing the number of stations per borough WITH stations_cte AS ( SELECT borough, COUNT(*) AS total_stations, COUNT(CASE WHEN express = 'express' THEN 1 END) AS exp_stations FROM nyc_subway_stations GROUP BY borough ) SELECT boroname, -- Calculate average population of New York AVG((SELECT SUM(popn_total) FROM nyc_census_blocks)) AS ny_populations, -- Calculate total population for each borough SUM(popn_total) AS boro_populations, -- Calculate average number of all stations per borough AVG(total_stations)::int AS stations_count, -- Calculate average people per all station ROUND(SUM(popn_total) / AVG(total_stations)) AS people_per_station, -- Calculate average number of express stations per borough AVG(exp_stations)::int AS express_stations_count, -- Calculate average people per express station (handle division by zero) ROUND(SUM(popn_total) / NULLIF(AVG(exp_stations), 0)) AS people_per_exp_station FROM nyc_census_blocks ncb LEFT JOIN stations_cte sc ON ncb.boroname = sc.borough GROUP BY boroname ORDER BY boro_populations DESC;Output:
| "boroname" | "ny_populations" | "boro_populations" | "stations_count" | "people_per_station" | "express_stations_count" | "people_per_exp_station" | |-----------------|------------------|--------------------|------------------|----------------------|--------------------------|--------------------------| | "Brooklyn" | 8175032 | 2504700 | 170 | 14734 | 31 | 80797 | | "Queens" | 8175032 | 2230621 | 90 | 24785 | 14 | 159330 | | "Manhattan" | 8175032 | 1585873 | 140 | 11328 | 62 | 25579 | | "The Bronx" | 8175032 | 1385108 | | | | | | "Staten Island" | 8175032 | 468730 | 22 | 21306 | 0 | |
4 Running geometries functions
Answer some questions about New York city streets using geometries functions.
What is the total length of all streets in New York in kilometers?
SELECT ROUND(SUM(ST_Length(geom))::numeric / 1000, 2) AS total_length_km FROM nyc_streets;Output:
"total_length_km" 10418.90How many types of streets are there and what is the total length of each type?
SELECT type, ROUND(SUM(ST_Length(geom))::numeric, 2) AS length_m FROM nyc_streets GROUP BY type ORDER BY length_m DESC;Output:
| "type" | "length_m" | |----------------------------------------------------|------------| | "residential" | 8629870.34 | | "motorway" | 403622.48 | | "tertiary" | 360394.88 | | "motorway_link" | 294261.42 | | "secondary" | 276264.30 | | "unclassified" | 166936.37 | | "primary" | 135034.23 | | "footway" | 71798.49 | | "service" | 28337.64 | | "trunk" | 20353.58 | | "cycleway" | 8863.75 | | "pedestrian" | 4867.05 | | "construction" | 4803.08 | | "residential; motorway_link" | 3661.58 | | "trunk_link" | 3202.19 | | "primary_link" | 2492.57 | | "living_street" | 1894.64 | | "primary; residential; motorway_link; residential" | 1367.77 | | "undefined" | 380.54 | | "steps" | 282.75 | | "motorway_link; residential" | 215.08 |Calculate some summary statistics about New York streets.
SELECT ROUND(SUM(ST_Length(geom))) AS total_length_m, ROUND(AVG(ST_Length(geom))::numeric, 2) AS avg_length, ROUND(percentile_cont(.5) WITHIN GROUP (ORDER BY ST_Length(geom))::numeric, 2) AS median_length, ROUND(MAX(ST_Length(geom))) AS longest_length, ( SELECT name FROM nyc_streets ORDER BY ST_Length(geom) DESC LIMIT 1 ) AS longest_street FROM nyc_streets;Output:
| "total_length_m" | "avg_length" | "median_length" | "longest_length" | "longest_street" | |------------------|--------------|-----------------|------------------|-------------------| | 10418905 | 545.75 | 269.11 | 14918 | "Leif Ericson Dr" |Find the named street of the northernmost point in New York.
Firstly, I need to find the northermost point in New York. The northermost point should be a coordinate with the highest lattitude. A street is a multilinestring type that contains multiple points. Therefore, my approach is to extract all points from all street geometries, then extract the lattitude and find the highest one. After the lattitude is found, I will find the name of the street by filtering street coordinates that contain this lattitude.
- First query: find the highest lattitude of streets in New York.
-- Dump all points from streets with non-null name. WITH dump_points_cte AS (SELECT ST_DumpPoints(geom) AS dp FROM nyc_streets WHERE name IS NOT NULL) -- Extract the highest lattitude point, i.e the northernmost point. SELECT ST_Y((dp).geom) AS lattitude FROM dump_points_cte ORDER BY 1 DESC LIMIT 1;Output:
"lattitude" 4529895.358189691- Second query: filter the street name based on the lattitude found earlier.
-- Find the non-null street that contains the highest lattitude point. SELECT name, ST_AsText(geom) FROM nyc_streets WHERE ST_AsText(geom) LIKE '%4529895.358189691%' AND name IS NOT NULL;Output:
| "name" | "st_astext" | |------------|----------------------------------------------------------------------------------------------| | "River Rd" | "MULTILINESTRING((591838.3024050258 4529895.358189691,591831.6719192386 4529886.060201272))" |- Third query: transform one point of the linestring from SRID 26918 to SRID 4326 to find the coordinates on Google Maps.
-- The WKT point coordinates are taken from one of the coordinates in the "st_astext" column. SELECT ST_AsText(ST_Transform(ST_GeomFromText('POINT(591838.3024050258 4529895.358189691)',26918),4326)) AS wgs_geom;Output:
"wgs_geom" "POINT(-73.90941589999974 40.91501400000343)"Reverse the order of the coordinates as 40.91501400000343 -73.90941589999974 (latitude, longitude), then search on Google Maps to double check.
5 Spatial relationships and spatial joins
Let’s imagine a scenario involving my friend, Fred. Fred is embarking on a month-long business trip to New York City and will be commuting to Wall Street regularly. His workplace is the renowned New York stock exchange (NYSE). He has asked for my assistance in identifying the perfect neighborhood for him to find accommodation. His key criteria include:
- The chosen neighborhood must be within a 5 km radius of Wall Street, and it must not be located in Manhattan borough.
- It should have at least 10 subway stations.
- Rank the neighborhoods based on population density in the ascending order. Fred prefers less populated areas.
5.1 Preparation
Let’s make some preparations before solving each of the criteria above.
Firstly, as there are three different Wall St in New York, I need to identify the correct Wall St in Manhattan. To do so, the “Wall St” I looking for should be within a 300 m radius of NYSE.
From Google Maps, I found the coordinate values of NYSE to be: 40.70723482298096, -74.01117110032183. My job is to transform it into SRID 26918 so that I can compare it with geometries within my dataset.
-- Transform the coordinates of NYSE to SRID 26918.
SELECT ST_AsText(
ST_Transform(
ST_GeomFromText('POINT(-74.01117110032183 40.70723482298096)', 4326),
26918
)
)Output:
'st_astext'
'POINT(583529.537283629 4506728.297607037)'Next, run a query to find all streets with the name ‘Wall St’. Then filter streets that are within a 300 m radius of NYSE.
SELECT
gid,
name,
ST_NPoints(geom),
ST_Length(geom)
FROM nyc_streets
WHERE name LIKE 'Wall St%'
AND ST_DWithin(geom, ST_GeomFromText('POINT(583529.537283629 4506728.297607037)', 26918), 300)Output:
| 'gid' | 'name' | 'st_npoints' | 'st_length' |
|-------|-----------|--------------|-------------------|
| 17385 | 'Wall St' | 9 | 603.2621123790169 |Create a table to store the geometry to use in future queries.
CREATE TABLE wall_st AS
SELECT geom AS wall_st_geom
FROM nyc_streets
WHERE gid = '17385';5.2 Identifying the perfect neighborhood
Find all neighborhood within a 5 km radius of Wall Street but doesn’t belong to Manhattan borough
It is now much more convenient to find the required neighborhoods.
SELECT name, boroname FROM nyc_neighborhoods WHERE ST_DWithin( geom, (SELECT wall_st_geom FROM wall_st), 5000 ) AND boroname != 'Manhattan'This query results in 11 neighborhoods, all in Brooklyn.
| 'name' | 'boroname' | |----------------------|------------| | 'Red Hook' | 'Brooklyn' | | 'Boerum Hill' | 'Brooklyn' | | 'Downtown' | 'Brooklyn' | | 'Greenwood' | 'Brooklyn' | | 'Carroll Gardens' | 'Brooklyn' | | 'Cobble Hill' | 'Brooklyn' | | 'Fort Green' | 'Brooklyn' | | 'Flatbush' | 'Brooklyn' | | 'Park Slope' | 'Brooklyn' | | 'Williamsburg' | 'Brooklyn' | | 'Bedford-Stuyvesant' | 'Brooklyn' |Filter neighborhoods with at least 10 subway stations
First, create a table to store the geometries of the 11 neighborhoods to make the future queries easier to read.
CREATE TABLE selected_neighborhoods AS (SELECT * FROM nyc_neighborhoods WHERE ST_DWithin( geom, (SELECT wall_st_geom FROM wall_st), 5000 ) AND boroname != 'Manhattan'); SELECT * FROM selected_neighborhoods;Output:
| 'gid' | 'boroname' | 'name' | 'geom' | |-------|------------|----------------------|------------------| | 125 | 'Brooklyn' | 'Red Hook' | '01060000202...' | | 28 | 'Brooklyn' | 'Boerum Hill' | '01060000202...' | | 34 | 'Brooklyn' | 'Downtown' | '01060000202...' | | 62 | 'Brooklyn' | 'Greenwood' | '01060000202...' | | 118 | 'Brooklyn' | 'Carroll Gardens' | '01060000202...' | | 29 | 'Brooklyn' | 'Cobble Hill' | '01060000202...' | | 94 | 'Brooklyn' | 'Fort Green' | '01060000202...' | | 47 | 'Brooklyn' | 'Flatbush' | '01060000202...' | | 54 | 'Brooklyn' | 'Park Slope' | '01060000202...' | | 59 | 'Brooklyn' | 'Williamsburg' | '01060000202...' | | 116 | 'Brooklyn' | 'Bedford-Stuyvesant' | '01060000202...' |Run a query to filter the neighborhoods with at least 10 subway stations.
SELECT sn.name, COUNT(nss.geom) FROM nyc_subway_stations AS nss JOIN selected_neighborhoods AS sn ON ST_Contains(sn.geom, nss.geom) GROUP BY sn.name HAVING COUNT(nss.geom) >= 10;Output:
| 'gid' | 'name' | 'count' | |-------|----------------------|---------| | 59 | 'Williamsburg' | 13 | | 94 | 'Fort Green' | 13 | | 116 | 'Bedford-Stuyvesant' | 13 |Rank the selected neighborhoods based on population density ascending.
SELECT sn.name, ROUND((ST_Area(sn.geom) / 1000000)::numeric, 2) AS area_km2, SUM(ncb.popn_total) AS population, ROUND(SUM(ncb.popn_total) / (ST_Area(sn.geom)/1000000)) AS popn_density_km2, RANK() OVER (ORDER BY (SUM(ncb.popn_total) / (ST_Area(sn.geom)/1000000)) ASC) FROM selected_neighborhoods AS sn LEFT JOIN nyc_census_blocks AS ncb ON ST_Intersects(sn.geom, ncb.geom) WHERE sn.gid IN ('59', '94', '116') GROUP BY sn.name, ST_Area(sn.geom);Output:
| 'name' | 'area_km2' | 'population' | 'popn_density_km2' | 'rank' | |----------------------|------------|--------------|--------------------|--------| | 'Williamsburg' | 7.61 | 135157 | 17763 | 1 | | 'Fort Green' | 5.83 | 125039 | 21451 | 2 | | 'Bedford-Stuyvesant' | 10.78 | 242871 | 22537 | 3 |Looks like Williamsburg is the best fit for my friend. However, Fort Green and Bed-Stuy also look very promising.
The calculated area, population, and population density in this project are intended for demonstration purposes only. Results may significantly differ from data available online. Use with caution.